Statistical Phrases in Automated Text Categorization
نویسندگان
چکیده
In this work we investigate the usefulness of n-grams for document indexing in text categorization (TC). We call n-gram a set tk of n word stems, and we say that tk occurs in a document dj when a sequence of words appears in dj that, after stop word removal and stemming, consists exactly of the n stems in tk, in some order. Previous researches have investigated the use of n-grams (or some variant of them) in the context of specific learning algorithms, and thus have not obtained general answers on their usefulness for TC. In this work we investigate the usefulness of n-grams in TC independently of any specific learning algorithm. We do so by applying feature selection to the pool of all α-grams (α ≤ n), and checking how many n-grams score high enough to be selected in the top σ α-grams. We report the results of our experiments, using several feature selection functions and varying values of σ, performed on the Reuters-21578 standard TC benchmark. We also report results of making actual use of the selected n-grams in the context of a linear classifier induced by means of the Rocchio method. Categories and subject descriptors: H.3.3 [Information storage and retrieval]: Information search and retrieval Information filtering; H.3.3 [Information storage and retrieval]: Systems and software Performance evaluation (efficiency and effectiveness); I.2.3 [Artificial Intelligence]: Learning Induction Terms: Algorithms, Experimentation, Theory
منابع مشابه
Feature Selection and Feature Extract ion for Text Categorization
The effect of selecting varying numbers and kinds of features for use in predicting category membership was investigated on the Reuters and MUC-3 text categorization data sets. Good categorization performance was achieved using a statistical classifier and a proportional assignment strategy. The optimal feature set size for word-based indexing was found to be surprisingly low (10 to 15 features...
متن کاملA Learner-Independent Evaluation of the Usefulness of Statistical Phrases for Automated Text Categorization
In this work we investigate the usefulness of n-grams for document indexing in text categorization (TC). We call n-gram a set gk of n word stems, and we say that gk occurs in a document dj when a sequence of words appears in dj that, after stop word removal and stemming, consists exactly of the n stems in gk, in some order. Previous researches have investigated the use of n-grams (or some varia...
متن کاملروش جدید متنکاوی برای استخراج اطلاعات زمینه کاربر بهمنظور بهبود رتبهبندی نتایج موتور جستجو
Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...
متن کاملResolving Ambiguous Preposition Phrase Using Genetic Algorithm
Text mining refers to the process of discovering interesting and non trivial patterns or knowledge embedded in unstructured text documents from a fixed domain. It is also known as knowledge discovery from text databases. Text mining tasks include text categorization, text clustering, concept/entity extraction, document summarization and entity relation modelling. Extracting concept/fact from th...
متن کاملGenre Analysis and the Automated Extraction of Arguments from Student Essays
A full understanding of text is out of reach of current human language technology. However, a shallow Natural Language Processing (NLP) approach can be used to provide automated help in the assessment of essays: our approach uses genre, cue phrases and a set of patterns. Cue phrases, with their associated semantics, are used in conjunction with patterns to identify categories of argumentation p...
متن کامل